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Abstract 

Background: Genomic selection can be implemented by a multi-step procedure, which requires a response 
variable and a statistical method. For pure-bred pigs, it was hypothesised that deregressed estimated breeding 
values (EBV) with the parent average removed as the response variable generate higher reliabilities of genomic 
breeding values than EBV, and that the normal, thick-tailed and mixture-distribution models yield similar reliabilities. 

Methods: Reliabilities of genomic breeding values were estimated with EBV and deregressed EBV as response 
variables and under the three statistical methods, genomic BLUP, Bayesian Lasso and MIXTURE. The methods were 
examined by splitting data into a reference data set of 1375 genotyped animals that were performance tested 
before October 2008, and 536 genotyped validation animals that were performance tested after October 2008. The 
traits examined were daily gain and feed conversion ratio. 

Results: Using deregressed EBV as the response variable yielded 18 to 39% higher reliabilities of the genomic 
breeding values than using EBV as the response variable. For daily gain, the increase in reliability due to 
deregression was significant and approximately 35%, whereas for feed conversion ratio it ranged between 18 and 
39% and was significant only when MIXTURE was used. Genomic BLUP, Bayesian Lasso and MIXTURE had similar 
reliabilities. 

Conclusions: Deregressed EBV is the preferred response variable, whereas the choice of statistical method is less 
critical for pure-bred pigs. The increase of 18 to 39% in reliability is worthwhile, since the reliabilities of the 
genomic breeding values directly affect the returns from genomic selection. 



Background 

Genomic selection in pure-bred pigs can be implemen- 
ted using a multi-step procedure. Effects of dense 
genetic markers are estimated using a reference popula- 
tion and these effects are used to predict genomic 
breeding values (GBV) of selection candidates [1]. 
Implementing a multi-step procedure relies on two pre- 
requisites: 1) a response variable that summarises the 
genetic information for reference animals, and 2) a sta- 
tistical method that associates the response variable to 
the marker information. The choice of response variable 
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and statistical method may well depend on the data 
structure. Therefore, the challenge is to find a suitable 
response variable and statistical method that can handle 
pure-bred pig data. 

Pure-bred pig data often have low and varying reliabil- 
ities of estimated breeding values (EBV). Some Duroc 
pigs in the Danish breeding scheme only have their own 
records, others only offspring records, while some have 
both - depending on the trait. Furthermore, only a sub- 
set of pigs are genotyped and these pigs tend to be clo- 
sely related. Therefore, a response variable and a 
statistical method capable of handling such data are 
needed. 

Among the possible response variables, EBV and 
deregressed EBV are the most promising to date. Gar- 
rick et al. [2] showed, at least in theory, that deregressed 
EBV with the parent average removed (hereafter referred 



O© 201 1 Ostersen et al; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
BiolVlGCl Ccntrsl Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly cited. 



Ostersen et al. Genetics Selection Evolution 201 1, 43:38 
http://www.gsejournal.Org/content/43/1/38 



Page 2 of 6 



to as "deregressed EBV") yield more accurate genomic 
breeding values than EBV for two reasons. First, dereg- 
ressed EBV as the response variable results in less dou- 
ble-counting compared to EBV, because the proposed 
deregressed EBV excludes ancestral information. If both 
an offspring and it's parent are genotyped, the degree of 
double-counting decreases when using deregressed EBV 
as the response variable. Second, using EBV as the 
response variable results in double shrinkage of the 
genomic breeding values, particularly when the reliabil- 
ities of the EBV are low. In dairy cattle, Guo et al. [3] 
compared the two response variables on simulated data 
and found that deregressed EBV yielded slightly lower 
reliabilities. However, their simulated data were charac- 
terized by low degrees of double-counting. It appears 
that most pure-bred pig data, with EBV with low reli- 
abilities and high degrees of double-counting, compared 
to Guo et al. [3], are more in line with the theoretical 
expectations. Therefore, this suggests that deregressed 
EBV as the response variable may yield higher reliabil- 
ities of genomic breeding values than EBV. 

Three types of statistical models have been widely used 
in the literature [1,4-14]. These are models that assume 
the marker effects to be normally distributed, models 
that assume a thick-tailed distribution of marker effects, 
and models that assume a mixture of two distributions. 
The performance of the different models is predomi- 
nantly affected by the number of QTL, the marker den- 
sity and the genetic relatedness of the population. 
However, there is no unambiguous evidence that one 
model will yield more accurate genomic breeding values 
for pure-bred pig data for two reasons. First, mixture-dis- 
tribution models have shown promising results for some 
traits, in particular traits under weak selection, possibly 
because these traits can be influenced to a greater extent 
by a few large QTL [9,15]. Second, low genetic related- 
ness between reference animals and validation animals 
appears to favour mixture-distribution models [14], pre- 
sumably because mixture-distribution models utilize 
linkage disequilibrium more efficiently than normal and 
thick tailed distribution models. Probably for the same 
reason, Hayes et al. [9] concluded that thick-tailed mod- 
els and models assuming normality were equally good 
only when data from a single cattle breed were analysed. 
It appears that for pure-bred pig data, for which there is 
strong selection on traits of interest and high relatedness 
between genotyped animals, none of the models would 
be favoured over the others. Therefore, the models 
assuming mixture-distribution, thick tailed distribution 
and normal distribution might yield similar reliabilities of 
genomic breeding values. 

In summary, we reasoned that 1) deregressed EBV as 
the response variable yields higher reliabilities of the 
GBV compared to EBV, and that 2) normal, thick-tailed 



and mixture-distribution models yield similar reliabilities. 
To test these hypotheses, the three models and the two 
response variables were assessed for the reliabilities by 
which they could predict GBV for the two traits, daily 
gain and feed conversion ratio, in Danish Duroc pigs. 

Methods 

Procedure 

Two response variables, EBV and deregressed EBV, and 
three statistical methods with normal, thick-tailed, and 
mixture-distributions, were assessed for their reliability 
to predict GBV for daily gain and feed conversion ratio 
in Danish Duroc pigs. The reliabilities were computed 
by splitting the genotyped animals into 1375 reference 
animals that were performance tested before October 
2008 and 536 validation animals that were performance 
tested after October 2008 (further details are given 
below). 

Data 

The Duroc pigs were part of the genetic evaluation sys- 
tem in Denmark. All data were supplied by the Danish 
Agriculture and Food Council, Pig Research Centre. 
Genotyping 

A total of 1911 Danish Duroc pigs were genotyped 
using the Illumina PorcineSNP60 BeadChip (Illumina, 
San Diego, CA). A total of 26142 SNP markers and each 
of the animals met the following requirements. Each 
animal had a call rate greater than 0.95. Each marker 
was mapped to an autosome, had a minor-allele fre- 
quency greater than 0.05, a call-frequency score greater 
than 0.95, and a heterozygote frequency that did not 
deviate from Hardy- Weinberg expectations by more 
than 1/4^/pq, where p and q are the allele frequencies 
at the marker. The l/4^/pq corresponds to a 1/4 stan- 
dard-deviation unit when assuming a binomial distribu- 
tion of alleles. For each animal and SNP-genotype 
combination, the GenCall score was greater than 0.65. 
Genotypes less than 0.65 were defined as missing. Ani- 
mals with missing genotypes were allocated the popula- 
tion mean for the missing markers. 
Performance test and pedigree 

Daily gain (g/day) and feed conversion ratio (feed units/ 
kg gain) were observed in the interval 30-100 kg live 
weight. Recordings were performed in the period 1992 
to 2010. The pedigrees of the animals were traced back 
to 1984, and consisted of 345686 and 52537 animals for 
daily gain and feed conversion ratio, respectively. Both 
pedigrees included 373 unknown parents (base animals). 

The reference data consisted of available records on 
October 1st 2008 and included 313068 and 23628 mea- 
surements of daily gain and feed conversion ratio, 
respectively. All of the 1375 genotyped animals in the 
reference data had their own records for daily gain, 
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whereas only 898 genotyped animals had their own 
records for feed conversion ratio. There were 680 geno- 
typed animals in the reference data that had more than 
four offspring records for daily gain and 633 genotyped 
animals that had more than four offspring records for 
feed conversion ratio. 

The full data consisted of records available until May 
2010 and included the records from the reference data 
and additional 32618 and 3344 measurements of daily 
gain and feed conversion ratio, respectively. All 536 gen- 
otyped validation animals were phenotyped for both daily 
gain and feed conversion ratio after October 1st, 2008. 

Response variables 

The EBV were calculated for daily gain and feed conver- 
sion ratio by single-trait animal models, and deregressed 
EBV were calculated by applying the procedure pro- 
posed by Garrick et al. [2]. 

The single-trait animal models were based on the rou- 
tine evaluation model, with the regression effect of start 
weight, fixed effect of herd-week-section, random effect 
of pen and a random additive genetic effect. The model 
fitted to daily gain also included a fixed effect for sex and 
a random effect for birth litter. The variance components 
were estimated using REML. The heritabilities for daily 
gain and feed conversion ratio were 0.27 and 0.21. The 
1375 reference animals had mean reliabilities for EBV of 
0.62 (sd = 0.18) and 0.36 (sd = 0.12) for daily gain and 
feed conversion ratio, respectively. The software DMU 
[16] was used to estimate variances and predict the 
breeding values. Single-trait animal models were used 
because preliminary analyses showed that predictions of 
GBV were not improved with a bivariate model. 

The deregression procedure of Garrick et al. [2] adjusts 
for ancestral information, such that the deregressed EBV 
only contains their own and the descendant's information 
on each animal. The deregression also eliminates shrink- 
age contained in the EBV, and therefore deregressed EBV 
behave as though they were observations with a heritabil- 
ity equal to the reliability of the deregressed EBV (reli- 
abilities computed as in Garrick et al. [2]). Deregressed 
EBV have unequal variances and should be used in a 
weighted analysis. To ensure the quality of deregressed 
EBV, only animals with a deregressed reliability above 
0.05 were included in the analysis. This resulted in 35 
animals being removed from the analyses with dereg- 
ressed EBV for feed conversion ratio. 

Statistical methods to estimate marker effects 

The marker effects were estimated by fitting linear, 
additive models to the response variables. The model 
was: 

Y = fi+Xp + s f (1) 



where Y is a n x 1 vector of responses, with n being 
the number of reference animals. The mean is denoted 
(A, which is a scalar. The coefficient matrix for the mar- 
kers, X, assume the values -1, 0 or 1, and has dimension 
n x p, where p is the number of markers. The vector of 
marker effects is denoted /? and has dimension p x 1, 
whereas the vector of residuals, s, has dimension hxI, 
and is distributed s ~ N(0, cr g 2 W), where W is a diago- 
nal matrix with elements w lf w n . 

For the analyses with EBV as the response variable, w x = 
. . . = w n = 1, whereas for the analyses with deregressed 
EBV as the response variable, a weighted analysis was per- 
formed according to Garrick et al. [2]. The weight for the 
ith animal was W\ = (1 — h 2 )/[(c + (1 — r 2 )/r 2 )h 2 ], where 
c, the part of the genetic variance not explained by mar- 
kers, was assumed to be 0.1, h 2 was the heritability of the 
trait, and rf was the reliability of the deregressed EBV of 
the ith animal. 

The statistical models studied differ by the distribu- 
tional assumptions made on /3, and by the shortness of 
presentation they are referred to by the corresponding 
inferential procedure. A brief description of each 
method is provided in the following. 
GBLUP 

The model for genomic BLUP (GBLUP) can be 
described as in equation (1), where the vector of marker 
effects are distributed f3 ~ N(0, o 2 T) . Alternatively, the 
model could be written as Y = p + g + s, where the 
genetic effects g ~ N(0, cr 2 XX T ) , i.e. a model with a 
genomic relationship matrix [7]. In this study, the latter 
form was used, where XX T for computational reasons 
was replaced by (X - P)(X - P) T with P containing the 
row means of X across animals. Variance components 
were estimated using REML, and the GBV were BLUP 
solutions from the mixed model equations. Software 
DMU [16] was used to fit GBLUP models. 
Bayesian Lasso 

The model for Bayesian Lasso assumes that the marker 
effects follow a double exponential distribution [12,17], 
which is a thick-tailed distribution. This is also referred 
to as a Laplace distribution. Thus, /3 ~ Laplace(0, XI) 
where A is a scaling parameter. The MCMC software 
BayZ [18] was used for estimation. A Gibbs sampler was 
applied and a total of 40000 iterations of sampling was 
performed, with a burn-in of 10000 iterations. 
MIXTURE 

The model for MIXTURE assumes that the marker 
effects follow a mixture of two normal distributions 
with similar characteristics as the mixture model in 
Meuwissen [19]. The distributional assumption can be 
described as: P ~ 7t 0 N{0, aj Q T) + 7TiN(0, ct^J), where 

tt 0 = 0.9, tti = 0.1, and cr 2 0 = 0.01 • a 2 The MCMC 
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software BayZ [18] was used for estimation. A Metropo- 
lis-Hastings sampler was applied with a total of 40000 
iterations of sampling and a burn-in of 10000 iterations. 

Evaluation criterion 

The predictive ability of each method was evaluated by 
the reliability of the GBV on the genotyped validation 
animals. To avoid use of overlapping information 
between the reference and validation animals [20], we 
based the validation on the own phenotype of genotyped 
validation animals, adjusted for fixed effects and non- 
genetic random effects. The computation of adjusted 
phenotypes for genotyped validation animals was based 
on the full data, and the adjusted phenotypes were the 
sum of EBV and the estimated residual errors [g + e) . 

The reliability of the GBV, r fcBV) > * s the squared corre- 
lation r fcBV,g) between GBV and the vector of genetic 
effects, g, for genotyped validation animals. Using that 
the vector of residual errors, e, for these animals is inde- 
pendent of the GBV, and that g + e~g + e,we obtain a 
formula for the reliability of the GBV 

2 _ 2 _ 2 

T (GBV) ~ r (GBV,g) ~ r (GBV,g + e) ' M 

~ r (GBV,g + e) • <°' 

where g is the vector of EBV for validation animals, g 
is the estimated residual vector, co = [aj + cr^)/cr^ f tr g 2 is 
the residual variance and is the additive genetic var- 
iance of the trait. The formula above for the reliability is 
similar to other formulas shown in the literature [14,21]. 

A test was performed to investigate whether the reli- 
abilities of the GBV from the two methods were signifi- 
cantly different from each other within each trait. The 
test hypothesis was that the reliability of GBV from 
method A was equal to the reliability of GBV from 
method B, i.e. H 0 : ^ GWa ^ • co = rf GBVB ^ e) -co. Since 
co is simply a constant for each trait, the test hypothesis 
can be reduced to H 0 : r^ GB v A ,g+e) = r {GBV B ,g+e). However, 
g + e appears in both correlations and the two correla- 
tions are not independent within each trait. This issue 
of testing equality of correlated correlations was 
addressed by the Hotelling-Williams £-test [22,23], 
which was applied to each trait with a confidence level 
of 5%. 

Results 

Response variables 

The response variable, deregressed EBV, resulted in 
higher reliabilities than EBV for both feed conversion 
ratio and daily gain (Table 1). For daily gain, the reliabil- 
ities were on average 35% higher for deregressed EBV 



with GBLUP, Bayesian Lasso and MIXTURE. The reli- 
abilities for daily gain increased from approximately 0.25 
to 0.34. For feed conversion ratio, the reliabilities were 
on average 18% higher for deregressed EBV for GBLUP 
and Bayesian Lasso, which approached significance (p- 
values between 0.05 and 0.11). Use of deregressed EBV 
instead of EBV increased reliabilities from approximately 
0.16 to 0.19 for GBLUP and Bayesian Lasso. For the 
method MIXTURE, the deregressed EBV yielded 39% 
higher reliabilities than EBV, increasing reliabilities from 
0.15 to 0.20. 

Statistical methods 

The three statistical methods, GBLUP, Bayesian Lasso 
and MIXTURE did not yield different reliabilities (Table 
1). For daily gain, the reliabilities ranged from 0.33 to 
0.34 for deregressed EBV and from 0.25 to 0.26 for 
EBV. For feed conversion ratio, the reliabilities ranged 
from 0.19 to 0.20 for deregressed EBV and from 0.15 to 
0.16 for EBV. 

Discussion 

The hypothesis that deregressed EBV yield higher reli- 
abilities than EBV was supported. The increase in relia- 
bility of 18 to 39% for Duroc pigs is worthwhile, since 
the reliabilities of the genomic breeding values directly 
affect the returns from genomic selection. The hypoth- 
esis that the different statistical methods yielded similar 
reliabilities was also supported. Therefore, we believe 
that when applying genomic selection within a pure- 
bred pig population, deregressed EBV is the preferred 
response variable, while the choice of the statistical 
method is less critical. 

The fact that the increase in reliabilities due to dereg- 
ression was so large was surprising, since results 
reported by Guo et al. [3] suggested that EBV is the 
more suitable response variable. There are three possible 

Table 1 Reliabilities of GBV, r ( GBV y based on the three 
statistical methods and the two response variables for 
daily gain and feed conversion ratio. 

r 1 



EBV deregressed EBV 



Daily gain T 






GBLUP 


0.26° 


0.33 b 


Bayesian Lasso 


0.25° 


0.34 b 


MIXTURE 


0.25° 


0.34 b 


Feed conversion ratio + 






GBLUP 


0.1 6 ab 


0.1 9 ab 


Bayesian Lasso 


0.1 6 ab 


0.1 9 ab 


MIXTURE 


0.15° 


0.20 b 



+ Reliabilities within daily gain and feed conversion ratio with different 
superscripts are significantly different (P < 0.05) Reliabilities for daily gain and 
feed conversion ratio are not comparable 
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reasons. In our study less information was available for 
the reference animals compared to Guo et al. [3], which 
implies that the expected double shrinkage was 
increased when using EBV as the response variable. The 
amount of information was more heterogeneous among 
the reference animals, and this favours deregressed EBV. 
The heterogeneity in amount of information was larger 
for daily gain than for feed conversion ratio, which also 
explains why the effect of deregression was larger for 
this trait. The combination of less information per refer- 
ence animal and several parent-offspring relationships 
within the reference population increases double-count- 
ing further. If a genotyped parent has only one offspring, 
and this offspring is genotyped, the degree of double- 
counting is much larger than if both the parent and the 
offspring have many ungenotyped offspring. So, the 
combination of sire-son relationships with a low and 
heterogeneous amount of information in the reference 
data seems to favour deregressed EBV over EBV. 

There are two reasons why the three statistical methods 
performed equally well. First, the traits have been subject 
to strong selection, which suggests that MIXTURE would 
not have an advantage over the other methods [9,15]. Sec- 
ond, we considered only one breed, in which the geno- 
typed animals were closely related, implying that Bayesian 
Lasso and MIXTURE, which utilize linkage disequilibrium 
more efficiently, have no advantage over GBLUP [9,14]. 
Thus, the small difference between the statistical methods 
is caused by the pure-bred pig data with its high related- 
ness between genotyped animals and traits that have been 
subject to strong selection. 

The results from this study may only partly apply to 
other species, traits and data structures. The positive 
effect of using deregressed EBV as the response variable 
will depend on the degree of double-counting, and the 
amount and heterogeneity of information for genotyped 
animals. However, since the deregressed EBV proposed 
by Garrick et al. [2] are theoretically more appropriate 
than EBV, we believe that deregressed EBV will be 
advantageous in most circumstances. In contrast, the 
conclusion about equal performances of the statistical 
methods is not generic. For traits controlled by large 
QTL, data from multiple breeds or distantly related gen- 
otyped animals, or when using a denser marker panel, 
MIXTURE and Bayesian Lasso could outperform 
GBLUP. The reason for this is that MIXTURE and 
Bayesian Lasso utilize linkage disequilibrium more effi- 
ciently. We believe that the positive effect of using 
deregressed EBV applies to most situations, whereas 
choosing the best statistical method is more sensitive to 
the particular situation. 

The applied multi-step method has the benefit of 
being readily applicable in most breeding schemes. EBV 
with reliabilities are easily predicted, and can then be 



used to calculate deregressed EBV [2]. The deregression 
procedure itself is very simple and only requires little 
computing power. The association of marker effects 
with the response variable by GBLUP is relatively sim- 
ple, since GBLUP is implemented in common breeding 
value estimation software. Bayesian Lasso and MIX- 
TURE are available in various MCMC software, 
although particularly MIXTURE is sensitive to the prior 
information given, which makes the use of these meth- 
ods less appealing. A final step, which has not been 
examined in the present study, would be to combine the 
genomic breeding values with the traditional pedigree 
information [24,25]. 
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